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Abstract 

A basic assumption of statistical learning theory is that 
train and test data are drawn from the same underlying dis- 
tribution. Unfortunately, this assumption doesn't hold in 
many applications. Instead, ample labeled data might exist 
in a particular 'source' domain while inference is needed 
in another, 'target' domain. Domain adaptation methods 
leverage labeled data from both domains to improve classi- 
fication on unseen data in the target domain. In this work 
we survey domain transfer learning methods for various ap- 
plication domains with focus on recent work in Computer 
Vision. 

1. Introduction 

The shortage of labeled data is a fundamental problem 
in applied machine learning. While huge amounts of unla- 
beled data is constantly being generated and made available 
in many domains, the cost of acquiring data labels remains 
high. Even, worse, sometimes the situation makes it highly 
impractical or even impossible to acquire labelled data (e.g. 
when the underlying distribution is constantly changing). 

Domain adaptation (sometimes referred to as domain 
transfer learning) approach this problem by leveraging la- 
belled data in a related domain, hereafter referred to as 
'source' domain, when learning a classifier for unseen data 
in a 'target' domain. The domains are assumed to be re- 
lated, but not identical (in which case it becomes a standard 
machine learning problem). 

This situation occur in many domains. A few examples 
are: event detection in across video corpora from different 
domains (e.g. different tv - stations), named entity recog- 
nition across different text corpora (e.g. sports text corpus 
and news corpus), object recognition in images acquired in 
different domains (webcam versus Amazon stock photos). 

Domain adaptation (DA) only recently started receiving 
significant attention [Daumelll, 2009, Chelba and Acero, 
2004, Daumelll and Marcu, 2006], in particular for com- 
puter vision applications [Saenko et al., 2010, Kulis et al., 
2011, Gopalan et al., 2011, Jhuo et al., 2012, Duan et al., 



2009, Bergamo and Torresani, 2010], although related field, 
such as covariate shift [Shimodaira, 2000] has a longer his- 
tory. It is perhaps indicative of the field being so new, that 
the proposed methods are of such different characteristic. 
To the best of our knowledge, there is only one previous 
survey of domain adaptation [Jiang, 2' ], which focused 
on learning theory and natural language processing applica- 
tions. Also [Pan and Yang, 2009] did a thorough survey on 
the related field of transfer learning. 

1.1. Related Fields 

As mentioned in the introduction, the shortage of labeled 
data is a fundamental problem for applied machine learn- 
ing. It is important enough that several areas of research 
is devoted to various aspects of this problem. In the ac- 
tive learning paradigm, labels are acquired in an interactive 
fashion to maximize the benefit of each new label [Kapoor 
et al., 2007]. Related approaches include [Branson et al., 
101 1], where a 'human-in-the-loop' determines which la- 
bels to update, thus making the 'most' out of the acquired 
labels. Crowd sourcing through, e.g. Amazon mechanical 
turk (mTurk), allows for rapid collection of large amounts 
of labels, and much research is devoted to the efficient dis- 
tribution of tasks and the interpretation and weighting of re- 
trieved labels [Welinder et al., 2010]. Further areas include 
weakly supervised method, e.g. multiple instance learn- 
ing [Dietterich et al., 1997] or latent structureal SVMs [Yu 
and Joachims, 2( ] where the level of supervision is lower 
than the given task demands. Other approaches include 
semi-supervised learning that make use of small amounts of 
labelled data together with large amounts of unlabeled data. 
Notably the concept of co-training [Blum and Mitchell, 
1998] is a popular approach. 

More closely related to domain adaptation is transfer 
learning [Pan and Yang, 2009]. In transfer learning (TL) the 
marginal distribution of the source and target data are sim- 
ilar but different tasks are considered. To make this prob- 
lem tractable, it is typically assumed a common prior on the 
model parameters across tasks. A computer vision exam- 
ple is 'one-shot learing' [Fei-Fei et al., 2006] where new 
visual categories are leaned using a single training example 



by leveraging data from other labelled categories. This is 
different from domain adaptation where the marginal data 
distributions of source and target are different, but the task 
is similar. 

Another related field is model-adaptation. Here, unla- 
beled data, sometimes referred to as background data or 
auxiliary data is used to regularize the class specific mod- 
els. This paradigm has had much success in speaker veri- 
fication [Reynolds et al., 2000], and has also been applied 
to computer vision problems [Dixit et al., 2011, L. Fei-Fei, 
2007]. The methods used in this field, such as Adapted 
Gaussian Mixture Models [Reynolds et al., 2000] could 
trivially be used in a domain transfer setting by discarding 
source data labels, and letting the source data constitute the 
background model. 

Cross-Modal classification / retrieval makes very simi- 
lar assumptions on the data compared to DA but assume 
instance, rather than class, level relationship between the 
domains. In Cross-Modal classification, ample data from 
both domains are available at train time, and the unknown 
sample can come from any of the modalities. 

2. Setup 

In this section we introduce notation and provide a 
overview of the paper. 

2.1. Notation 

Let X denote the input, and Y the output random vari- 
able. Let P(X, Y) denote the joint probability distribu- 
tion of X and Y. Let, similarly P(X) and P(Y) denote the 
marginal probability distributions. In the domain adaptation 
scenario, as mentioned in the introduction, we have two dis- 
tinct distributions. Let P S (X, Y) denote the source distribu- 
tion where, typically, we have access to ample labelled data, 
and let P t (X, Y) be the target distribution that we seek to 
estimate. We also let P(X — x.Y = y) = P(x,y) refer 
to the joint probability, thus differentiating it from P(X, Y) 
that represents the probability distribution. 

Data is available from three sets: labelled data from the 
source domain Si = {(xf, yf)}£Li, x s g M. ds , drawn from 
a joint source probability distribution, P S (X,Y); labelled 
data from the target domain, 77 = {(x-' ( , yf)}£L'i , x 4,i £ 
R dt , drawn from a joint target distribution, P t (X,Y); and 
unlabeled data, T u = {(x-'*)}^" , x*<" € R dt from a 
marginal target distribution P t (X). It is commonly as- 
sumed iV s ^> N*' , and d s = d t . The target and source 
labels are generally assumed to belong to the same space, 
e.g. for the k-class classification task, y s , y* € Z fc . We fur- 
ther let D % be the data matrix for domain i, with one data 
sample per column, i e {t, s}. 

The goal of domain adaptation (DA) can thus be sum- 
marized as that of learning a function y = /(x„|Z>) that 



predicts the class, y u of an unseen sample from the target 
with high probability, P t (Y — y\X = x u ). I? is differ- 
ent depending on the data assumptions. In supervised DA, 
T> = Si U 77, in unsupervised DA T> = Si U T u , and semi 
supervised DA: T> = Si U 71 U T u . 

2.2. Overview 

As mentioned in the introduction domain adaptation is 
a relatively new field. It is also relatively loosely defined 
with regards to e.g. how 'related' are the domains, and how 
'few' labelled samples exist in the target domain. Also, the 
general problem statement applies to several application do- 
mains, such as natural language processing and computer 
vision. For all these reasons, there is a big variety in the 
proposed methods. Inspired by the categorization proposed 
in [Jiang, 2008], we begin by considering instance weight- 
ing methods for relaxation of the DA assumptions in Sec. 3. 
We consider methods utilizing the source data to regularize 
target models in Sec. 4. We then survey method seeking 
common representation across domains in Sec. 5. Section 
6 make connections to transfer learning, and Sec. 7 briefly 
survey method for multi-modal learning. 

3. Instance Weighting 

Following [Jiang, 2 )8], we first consider two relax- 
ations of the DA problem. For the analysis we will use 
the empirical risk minimization framework proposed by 
[Vapnik, 1999] for standard supervised data. Here we let 
9 e be a model parameter from a given parameter space, 
and 9* be the optimal parameter choice for the distribution 
P(X, Y). Let further l(x, y, 9) be a loss function. In this 
framework we want to minimize 

9* = argmin V P(x,y)l(x,y,6) (1) 
see 

(x, y )exxy 

P(X, Y) is unknown but we can estimate it with the empir- 
ical distribution, P(X, Y). 

9 = argmin ^ P(x,y)l(x,y,9) (2) 

(x,y)exxy 

N 

= ^gminy^J(xj,yi,9). (3) 
i=l 

[Jiang, 2008] extend this to the DA problem and arrive at 
the following formulation 

9* = argmin P t (x,y)l(x,y,6) (4) 

e (x, y )exxy 

£1 Pj T t „t\ 

w argmin V - ) i,yi { l(x?yf,0). (5) 
^s{^P s {xl,yl) > 
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We see that weighing the loss of (source) training sample 
by p*^'^] provides a solution that is consistent with the 
empirical risk minimization framework. Clearly, if we had 
a good estimate of Pt (X, Y) we would already be done, 
so this doesn't really help us, but the formulation is useful 
for the discussion below. In the following we consider two 
relaxations of the DA problem formulation. Class imbal- 
ance: P t (X\Y = y) = P S (X\Y = y) and covariate shift: 
Pt(Y\X) = P S {Y\X). 

3.1. Class Imbalance 

One way to relax the DA problem formulation is to as- 
sume P t (X\Y) = P S {X\Y), but P t (Y) ^ P S (Y). This 
is called class imbalance, population drift or sampling bias. 
Consider, for example, training data sampled from a remote 
sensing application. Test data collected a at a later occa- 
sion may have different class distribution due to a changed 
landscape. Taking the assumptions into account, the ratio 
n t( } x '' v \ becomes 



Pt(x,y) 
P s {x,y) 



Pt(y) Pt(x\y) 
P s (y) P s (x\y) 

P t (y) 
PsivY 



(6) 
(7) 



and we only need to consider jrfey- This approach was 
explored in [Lin et al., 2002]. We can also re-sample the 
data to make the class distributions equal. 

3.2. Covariate Shift 

Covariance shift [Shimodaira, 2000], is another relax- 
ation of DA. Here, given an observation, the class dis- 
tributions are same in the source and target domains, but 
the marginal data distributions are different. P t (Y\X) = 
P S (Y\X), but P t {X) ^ P S (X). This situation arise, for 
example, in active learning, where the P S (X) tend to be 
biased to lie near the margin of the classifier. At a first 
glance, this situation appears not to present a problem, since 
P t (Y\X) = P S (Y\X), which we can estimate from the 
data. Here is why it becomes a problem in practice. Assum- 
ing, first of all, that the model family we use is mismatched 
to the data, i.e. regardless of what parameter we choose the 
model won't fit the underlying distribution. Under this as- 
sumption, covariate shift becomes a problem for the follow- 
ing reason. The optimal fit of the source data will be such 
that it minimize model error in the dense area of P S (X) 
(because these areas will dominate the error). Now, since 
Pt(X) is different from P S (X), the learned model will not 
be optimal for the target data (again, since the model family 
is mismatched). 

As in the previous section, p^fe~] can be simplified un- 
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Figure 1. The MEGA model proposed in [Daumelll and Marcu, 
2006]. This model assumes the data is in fact generated by three 
distributions, a target, a common and and source. The MEGA 
model leams a classifier for each space. Left is the standard logis- 
tic regression model. 



der these assumptions 

Pt{x,y) 
P s {x,y) 



P t {x) P t (y\x) 
P s (x) P s {y\x) 
Pt(x) 

p s (xY 



(8) 
(9) 



Again, a well founded solution can be identified by appro- 
priate instance weighting of the loss function. [Shimodaira, 
] explored this approach and show that the weighted 
model better estimate the data given a biased sampling func- 
tion. The quantity p-^fy can be estimated using e.g. non- 
parametric kernel estimation [Sugiyama and Mulcer, 2005, 
Shimodaira, 2000]. [Huang et al., 2007] proposed to di- 
rectly estimate the ratio, i.e. the difference between the two 
distributions. They use the Kernel Mean Match, 



N s N* 

^£/^(xf)-^$>(x<) 



(10) 



metric that measures the distribution distance in a Repro- 
ducing Kernel Hilbert Space. 

4. Source Distribution As Prior 

Often, the simplifying assumptions of the previous sec- 
tion doesn't hold. This section discuss method that use prior 
probabilities estimated on the source data to regularize the 
model. We first cover priors in the bayesian sense, and then 
some examples of discriminative methods. 

4.1. Bayesian Priors 

Maximum a posterior (MAP) estimation of model pa- 
rameters is central in bayesian statistics. In this setting prior 
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(a) (b) 
Figure 2. Figure from [Reynolds et al., 2000] illustrating the 
adapted GMM model. The left figure shows the universal GMM 
estimated from the background data together with the speaker- 
specific train data. The right shows the adapted model. 



knowledge about the model can be incorporated in a prior 
probability of the parameters, P(0). Specifically, instead of 
finding optimal parameters 6* as 



6* = axgmaxJJPtj/ilxijfl), 



(11) 



one solves 



6* = argmaxP(0)jji , (l/il*<»< 



(12) 



i=l 



In domain adaptation we can estimate the prior probability 
from the source domain as 



9* =argmaxP(#|S ; )n p (24l 



(13) 



[Chelba and Acero, 2004] pursued this approach in adapt- 
ing a maximum entropy capitalizer. [Daumelll and Marcu, 

06] argued that this two step process (first estimating 
P(9) from Si and then estimating 6) was non-intuitive and 
suggested an ensemble model that considered three classi- 
fiers simultaneously, one for the target, one for the source 
and one for the joint portion of the data. This generative 
model, that they denote MEGA, is shown in Fig. 1. 

When P(9\S) is estimated with unlabeled data, the prob- 
lem is technically no longer domain adaptation but rather 
model adaptation. Adapted Gaussian Mixture Models have 
successfully been applied to speaker verification [Reynolds 
et al., 2000, W. M. Campbell, 2006], and recently also 
for computer vision [Dixit et al., 2011]. Figure 2 show a 
schematic illustration of adapted GMM. 

4.2. Discriminative Priors 

In this section we survey work that investigate modify- 
ing the support vector machine (SVM) algorithms for the 
domain adaptation problem. These methods typically use a 



already-trained SVM in the source domain as input to sub- 
sequent training. The source data is thus used to regularize 
the output model in a similar way as in Sec. 4. 1 . 

[Yang et al., 2007] propose the adaptive support vector 
machine (ASVM). The basic idea is to learn a new decision 
boundary that is close to that learned in the source domain. 
The source data is thus acting as a regularizer on the final 
model. This method presume the existence of a SVM model 
f s (x) trained on the source domain data. They let the final 
decision function, /(x) be the sum of / s (x) and w T c/)(x). 
The final classifier is attained by solving the following con- 
strained optimization problem. 



• 1,, 
mm— w 

w 2 



s.t. & > 



CE6 



yi (f ( Xi )+w T x<) >1 



One problem with this formulation is that it doesn't strive 
for a large margin, but rather a solution close to the 
source solution. This is only reasonable for situation where 
P t (X,Y) is similar to P S (X,Y). To address this [liang 
et al., 2008] proposed the Cross-Domain SVM (CDSVM). 
CDS VM relax the constraints that the final model need to be 
similar to the old one by only enforcing proximity where the 
support vectors of f s (x) are close to any of the target data. 
They do this by introducing additional constraints that the 
old support vectors, just like the target data points, should be 
correctly classified. These constraints are only active when 
the old support vectors are close to any part of the target 
data. Specifically, they solve the following constrained op- 
timization problem 



min— llwl 
w 2 



\Ti\ 
i=l 



M 

'£ 

3=1 



s.t. y l (w T 0(x J ) - b) > 1 - > O.Vte.ifc) G 71 
y s i(w T ( p(v s j ) - 6) > 1 - £ e > 0, V(v*, y|) G V,. 

Here vj G V s are the support vectors of f s (x) with signs 
y a j. The authors let cr(v s , 77) be a gaussian that determines 
which vectors are close. [Bergamo and Torresani, 2010] 
provides a survey of other SVM-based DA methods. 

5. Common representation 

The perhaps most intuitive way to do domain adaptation 
is to create a feature map such that the source and target 
distributions are aligned. In other words, finding functions 
gt(X) and g s (X) for which 

P t (Y = k\g t (X) = x) = P S (Y = k\g s (X) = x) 

V(y,x)e(yxX) (14) 
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where functions g t (X) and g s (X) might be equal, related or 
even identity, depending on the method. [Jiang, 2008] note 
that the entropy of Y\g(X) is likely to increase compared to 
Y\X since the feature representation usually is simpler after 
the mapping, and thus encode less information. This means 
that Bayes error is likely to increase, and a good algorithm 
for domain alignment should take this into account. A sim- 
ple, and straight forward way of doing this is by feature se- 
lection. [Satpal and Sarawagi, 2006] proposed a method for 
this that remove features to minimize an approximated dis- 
tance function between the source and target distributions. 
Specifically they minimize J^keK ^( E Si > E T ) wnere 



/fc(x;,y;) 
N 



(xi,yi)ESi 

E r u = E £/*(**> y) p (yN> w )> ( 16 ) 

(xi)eT u y 

are the expectation of feature value k for Si and T u - The 
objective function is 

argmax V* V) w k f k (x, y) - log^x) (17) 



(x, v )eSi fce-ff 



k£K 

[Duan et al., 2009] took another aproach towards the 
same goal. They follow [Borgwardt et al., 2006] and use the 
Maximum Mean Discrepancy (MMD) criterion, to compare 
data distributions based on the distance between the means 
of samples from the two domains in the Reproducing Ker- 
nel HUbert Space (RKHS), 



(19) 



The authors integrate this distance with the standard SVM 
loss function 

[k*,f*] = argminfi(dist fc (D s ,D*)) + aSVM kJ {D), (20) 

thus jointly finding (1) akernel that minimize distfc(Z? s , D f ) 
and (2) a SVM decision function, SVM k j, that separate 
the data in kernel space. To make this tractable they iter- 
atively solve for (the parameters of) a parameterized mix- 
ture of kernel functions and the (a parameters of) the SVM 
loss function. They show improvements on the TRECVID 
dataset over related approaches. 

[Blitzer et al., 2006] proposed Structural Correspon- 
dence Learning (SCL). SCL finds a feature representation 
that maximize the correspondence between unlabeled data 
in source and target domain, by leveraging pivot feature that 
behave similarly in both domains. For example, if the word 



on the right is 'required' then the query word is likely a 
noun. This, then, helps disambiguate words such as 'sig- 
nal', which can be both a noun and an adjective. This al- 
gorithm works on unsupervised data, and therefore doesn't 
maximize the correspondence between P t (Y\g t (X)) and 
P s (Y\g s (X)) directly, but rather on related tasks. 

Several recent papers from the computer vision commu- 
nity pursue this idea. [Saenko et al., 2010], and later [Kulis 
et al., 2011] proposed variations on a metric learning for- 
mulation, where they not only learn a mapping that aligns 
the feature spaces but that also maximize class separation. 



minr(W) + AV cf 



i,3 



(21) 



Here D f and D s are the target and source (labelled) data 
matrices respectively, with one sample per row. Saenko et 
al. chose to r(W) and Cj() as, 



r(W) =tr(W)-logdet(W) 



-1,3 



--\\x\,x)\\ w <u\yt = yj 

\\x\,x]\\ w >l\y\^yl.. 



(22) 



where \\a, 6||q is the Mahalanobis distance between a and 
b with respect to matrix Q. With these constraints this for- 
mulations is known as information theoretic metric learning 
(ITML) [Davis et al., 2007], and the algorithmic contribu- 
tion of the paper is to enforce that each pair of datapoints are 
from the source and target domain, respectively. They state 
that this is crucial to ensure a domain transfer transform is 
learned. 

The authors note that since logdet(M^) is only defined 
for positive definite matrices, one can decompose W as 
W = L T L. The mapping, therefore, is symmetric since 
(D^WD 8 = {D^^LD 8 = (LD l ) T {LD S ). [Kulis 
et al., 2011] address this by changing regularizer to the 
squared frobenius norm. They also changed the constraints 
to encode similarity of data samples rather than the Maha- 
lanobis distance. The new formulation becomes 

r(W) ' 



:\\W\\ 2 F 



- ma,x(0, xlWx'j - u) 2 \ y\ = y] 

ma.xiQJ -xlWx*) 2 \y\ £y]. (23) 

Kulis et al. show how to kernelize this formulation. Their 
method show minor improvements on the 'Saenko Items' 
dataset (Table 2) compared to [Saenko et al., 2010, 
Daumelll, 2009] and baseline methods. 

[Jhuo et al., 2012] very recently proposed a formula- 
tion where the goal is to map the source data, by a matrix 
W £ M. dxd , to an intermediate representation where each 
transformed sample can be reconstructed by a linear combi- 
nation of the target data samples, 



WD S 



D l Z. 



(24) 
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Figure 3. Figure from [Gopalan et al., 2011] illustrating the pro- 
posed method. 

where Z g K" ' x N . They propose the following formula- 
tion to solve for low rank solutions. 

min rank(Z) + all-Elb 1, 

W,Z,E 

s.t. WS = TZ + W, 

WW T = I. (25) 

To solve this problem they relax the rank constraint to the 
nuclear norm and then apply a version of the Augmented 
Lagrange Multiplier (ALM) method [Lin et al., 2010]. 

Another recent method for computer vision also pro- 
pose a mapping to a common representation [Gopalan et al., 
101 1]. Motivate by incremental learning, they create in- 
termediate representation between the source and domain 
data by viewing the generative subspaces created from these 
domains as points on a Grassmanian manifold. Intermedi- 
ate representations can then be recovered by sampling the 
geodesic path. The final feature representation is a stacked 
feature vector, from each location along the path. They use 
partial least squares to leam a model on this extended fea- 
ture representation. Table 1 show the evaluation of this and 
several other discussed methods. 

6. Transfer Learning 

As mentioned in the introduction, transfer learning, 
sometimes called multi-task learning is different from DA. 
In transfer learning (TL) the joint probability of each task 
{P(Yk, X)}™ =1 are different but there is only one marginal 
data distribution P(X). Normally, the state space of the 
are assumed to be different, e.g. O(Yi) ^ Q(Y 2 ). When 
learning class conditional models, {P(Yk\X, dk)}^ l =1 , it is 
typically assumed a common prior distribution of the vari- 
ables e x ...6 k ~ P e {9). 

DA, while formally different, can be thought of as a spe- 
cial case of transfer learning with two tasks, one on the 
source, and one on the target, where fl(Y s ) — fl(Y t ). 

The classic paper by [Daumelll, 2 ] can be viewed 
in this framework. Daume propose a simple feature space 



augmentation by 

= (x,x,0) (26) 
&(x) = (x,0,x) (27) 

This 'frustratingly easy' method show promising perfor- 
mance doing named-entity recognition on several text 
datasets. Under a linear classification algorithm, this is 
equivalent to decomposing the model parameters for class k 
as er c + <Tk, where er c is shared by all domains. This formu- 
lation is basically identical to the one proposed by [Evge- 
nious and Pontil, 2004] for the purpose of transfer learning. 
The authors [Daumelll, 2009] provide a different analysis 
in the paper, where they argue the similarity to the method 
of [Chelba and Acero, 2004]. 

7. Multi-Modal Learning 

In this section, we discuss the concept of multi-modal 
learning. In this setting, correspondences are assumed to 
be on instance, rather then category, level. Also, here it 
is commonly assumed ample train data is available in both 
domains. Similarly to Sec. 5, the common goal of most 
methods is to estimate transformations L t nad L s so that 
P S ((X T L S = x\Y = k) = P t (X T L t = x\Y = k). This 
can be done by letting e.g. L s = I, thus mapping the target 
domain to the source domain, or vice versa. One could also 
consider mapping both spaces into a common space. We 
will begin this section by reviewing Canonical Correlation 
Analysis, Principal Component Analysis, Linear Discrimi- 
nant Analysis. We then consider recent work utilizing these 
methods [Sharma et al., 2012]. 

7.1. Background 

In this section we recap the formulations of Principal 
Component Analysis (PCA), Linear Discriminant Analysis 
(LDA) and Canonical Correlation Analysis (CCA). 

Principal Component Analysis: PCA, is a popular di- 
mensionality reduction method that projects the data into 
direction of maximum variance. It can be derived as fol- 
lows. Let Xi . . . x n be the input data. Let wi be the desired 
projection direction. Let also wf wi = 1. The mean of the 
projected data is wfx, where x — J^i x i- The variance 
of the projected data is 

1 N 

var(x) = — ^(wfx, - wf x) 2 , (28) 

i 

which can be expressed in terms of the data covariance ma- 
trix, 

1 - 

S^-^i^-^-xf, (29) 
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[Kulis et al., 






[Jhuo et al., 


[Yang et al., 
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>012] 


2007] 


Source 


Target 


Naive 


asymm 


symm 


Unsupervised 


Supervised 


RDALR 


A-SVM 


webcam 


dslr 


22.13 ± 1.2 


25 


27 


19 ± 1.2 


37 ± 2.3 


32.89 ± 1.2 


25.96 ± 0.7 


dslr 


webcam 


32.17 ±0.8 


30 


31 


26 ± 0.8 


36 ± 1.1 


36.85 ± 1.3 


33.01 ±0.8 


amazon 


webcam 


41.29 ± 1.3 


48 


44 


39 ± 2.0 


57 ± 3.5 


50.71 ± 0.8 


42.23 ± 0.9 



Table 1. Evaluation of discussed method on the DA dataset introduced in [Kulis et al., 201 1]. The naive method train on Si U 71. The 
methods are trained on 8 images per category (if source is webcam or dslr or 20 image (for amazon) from the source domain and 3 per 
category for the target domain. The best result for each experiment is marked in bold. 



as u T Su. We now maximize the projected variance with 
respect to w\. The constrained maximization problem can 
be written as 



wi = arg max Wi SttWi 

wi 

S.t. Wi T W 1 = 1 



(30) 



Linear Discriminant Analysis: While PCA is very 
popular for unsupervised data dimensionality reduction pur- 
poses it is agnostic to class, and might project data in direc- 
tions that are unsuitable for class discrimination. Linear 
Discriminant Analysis (LDA) finds projection directions by 
minimizing the within class scatter matrix while maximiz- 
ing the between class scatter. LDA can be derived as fol- 
lows. Let X\ = {x}, . . . x^} and Xi = {xf , . . . xf 2 } be 
samples from two different classes. The data projection di- 
rection wi is given by solving 

wi = argmaxwi^^BWi 

wi 

s.t. w/S'h'W! = 1 (31) 

where 

S B ■= (m x - m 2 )(m 1 - m 2 ) T 

s w -=Yl E ( x - m *)( x - m ») T (32) 

are the within and between scatter matrices, and m, is the 
mean of samples in Xy Note the very similar forms of 
Eq. (30) and Eq. (3 1). Eq. (30) is a regular eigenvalue prob- 
lem, while Eq. (31) is a generalized eigenvalue problem. 

Canonical Correlation Analysis: Developed by 
[Hotelling, 1936] CCA is a data analysis and dimension- 
ality reduction method, that can be though of as a multi- 
modal extension to PCA. CCA finds basis vectors for two 
sets of variables such that the correlations between the pro- 
jections of the variables onto these basis variables are mu- 
tually maximized. Using notation from Sec. 2.1, we let D s 
be a n s by d s matrix with rows xf , and D l similarly. CCA 



finds projection directions w s and to maximize the cor- 
relation between the projected data. More formally, it finds 
projection directions by solving the following optimization 
problem 



max corr(D t MV t , D s w s ) 

w* ,w s 



max 

w l ,w s 



max 

w l ,w s 



||L> t w t ||||D s w s || 
w tT £ ts w s 
\J wJS tt w t w sT E ss w s 



(33) 
(34) 

(35) 



where £y is the covariance matrix between data in domains 
i and j, i,j £ {s,t}. Note that this formulation requires 
the same number of samples from each domain, but not the 
same dimensionality. We also note that the optimization can 
be written as a constrained optimization as 



max 





T 


E ts ' 






w s 




. s rt o 




w s 



s.t. 





T 


' £« 




w* 


w s 




E ss _ 




w s 



= 1 



(36) 



(37) 



The optimization problem Eq. (33) (or Eq. (36)) can be 
formulated as a generalized eigenvalue problem which 
can be solved as efficiently as regular eigenvalue prob- 
lems [Hardoon et al., 2004], For more details [Hardoon 
et al., 2004] provides an excellent analysis on CCA and the 
kernelized version KCCA. 

7.2. Generalized Multiview Analysis 

GMA was proposed by [Sharma et al., 2012] as a unify- 
ing framework for learning multi modal discriminative lin- 
ear projections. They argue that methods such as PCA and 
LDA do not handle multi-view data. On, the other hand, 
methods such as CCA is not supervised. Other methods 
such as SVM-2K [Farquhar et al., 2005], CDSVM [Jiang 
et al., 2008], ASVM [Yang et al, 2007] meet these criteria 
but do not generalize well to unseen classes. The unifying 
framework of GMA is 
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max 

w t .w s 





T r 




w s 





s.t. 





T r 


w* 




w s 





B, 




A t 


aZ t Z] 






w* 


aZ s Zj 




jj,A s 






w s 













1. 


jB s _ 




w s 







(38) 
(39) 



PCA can be recovered (in the i'th view) through this frame- 
work by setting A.- L — E#,.Bj = I. Similarly, LDA can 
be recovered A % = S B ,B l = S w ', with S B and S w de- 
fined as above. CCA can be recovered as Ai = 0, Bi = 
D i W i (X i ) T , mAZi =Xi. 

Using this framework they propose two methods, Gener- 
alized Multiview LDA and Generalized Multiview Marginal 
Fisher Analysis (GMMFA). Here we recap only GMLDA. 
As noted above, LDA in the i'th view can be achieved by 
setting Ai = S B , B { = S w . By setting Zi = M,-, a ma- 
trix with columns that are class means, they enforce class 
mean alignment across classes. The authors also note that 
the two step process of LDA + CCA or vice versa needs to 
be considered as a baseline method. Similar approaches in- 
clude [Rasiwasia et al., 2010] who introduced semantic cor- 
relation matching, which uses logistic regression to com- 
bine CCA with semantic matching. They also introduce the 
WikiText data set (Table 2). 

The proposed method outperforms all baseline methods 
on multiPIE and VOC2007 and is on par with the domain 
specific approach by [Rasiwasia et al., 2010] on the Wiki- 
Text dataset. 



Name 


Description 


Instance or 
Class corre- 
spondence 


MultiPIE 


Face recognition data set 
containing face images 
under different Pose Illu- 
mination and Expression 


Instance 


WikiText 


Each item is represented 
using a text and an im- 
age. 


Instance 


Pascal 
VOC 2007 


5011 / 4952 (training / 
testing) image-tag pairs 


Instance 


Office 
dataset 


Object images from 
Amazon, SLR and 
webcam 


Class (subset 
with instance) 



Table 2. Computer vision datasets for domain adaptation method 
benchmark. 
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